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ABSTRACT 

Motivation: Staining the mRNA of a gene via in situ hybridization 
(ISH) during the development of a Drosophila melanogaster 
embryo delivers the detailed spatio-temporal patterns of the gene 
expression. Many related biological problems such as the detection 
of co-expressed genes, co-regulated genes and transcription factor 
binding motifs rely heavily on the analysis of these image patterns. 
To provide the text-based pattern searching for facilitating related 
biological studies, the images in the Berkeley Drosophila Genome 
Project (BDGP) study are annotated with developmental stage term 
and anatomical ontology terms manually by domain experts. Due to 
the rapid increase in the number of such images and the inevitable 
bias annotations by human curators, it is necessary to develop 
an automatic method to recognize the developmental stage and 
annotate anatomical terms. 

Results: In this article, we propose a novel computational model 
for jointly stage classification and anatomical terms annotation 
of Drosophila gene expression patterns. We propose a novel 
Tri-Relational Graph (TG) model that comprises the data graph, 
anatomical term graph, developmental stage term graph, and 
connect them by two additional graphs induced from stage or 
annotation label assignments. Upon the TG model, we introduce 
a Preferential Random Walk (PRW) method to jointly recognize 
developmental stage and annotate anatomical terms by utilizing the 
interrelations between two tasks. The experimental results on two 
refined BDGP datasets demonstrate that our joint learning method 
can achieve superior prediction results on both tasks than the 
state-of-the-art methods. 

Availability: http://ranger.uta.edu/%7eheng/Drosophila/ 
Contact: |heng@uta.edu| 



1 INTRODUCTION 

The mRNA in situ hybridization (ISH) provides an effective way to 
visualize gene expression patterns. The ISH technique can precisely 
document the localization of gene expression at the cellular level 
via visualizing the probe by colorimetric or fluorescent microscopy 
to allow the production of high quality images recording the 
spati al location and intensity of the gene expression ( Fowlkes 
et a/. j2008|-|Hendriks et a/.U2006l:li:ecuver et a/.Ll2007l: Megason 
and Fraser, I2007I) . Such spatial and temporal characterizations of 
expressions paved the way for inferring regulatory networks based 
on spatio-temporal dynamics. The raw data produced from such 
experiments includes digital images of the Drosophila embryo 
(examples are visualized in Fig. [T} showing a particular gene 
expression pattern revealed by a gene- specific probe (Grumbling 
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fly Drosophila melanogaster is one of the most used model 
in developmental biology. 
Traditionally, such ISH images are analyzed directly by the 
of microscope images and available from well-known 
ases, such as the Berkeley Dros ophila Genome Project (BDGP) 
expressi on pattern database (Tomancak et all I2002L I2007I) 
FISH (iL'ecuver et alUlOOm . To facilitate spatio-temporal 
gene expression pattern studies, researchers needed 
two challenging tasks first: Drosophila gene expression 
stage recognition (temporal descriptions) and anatomical 
ation (spatial descriptions). As shown in Figure Q] Drosophila 
genesis has been subdivided into 17 embryonic stages. These 
are defined by promi nent features that are dis tinguishable in 
Drosophila embryos dWeigmann et all\2003\) . To recognize 
ges of the Drosophila, embryos provide their time course 
On the other hand, the Drosophila gene expression patterns 
rec orded by controll e d voc abularies from the biologist's 
(iTomancak et all l2002h . Such anatomical ontology 
describe the spatial biological patterns and often cross stages, 
is more, because the ISH images are attached to each other 
:tively becoming bags of images, the corresponding stage label 
as anatomical controlled terms are the descriptions of the 
^ group of images instead of each individual image inside 
A Drosophila embryo ISH image bag belongs to only 
s tage, but has multiple related anatomical terms. Previously, 
two tasks are tackled by domain experts. However, due to 
Did increase in the number of such images and the inevitable 
annotation by human curators, it is necessary to develop an 
method to classify the developmental stage and annotate 
mical structure using controlled vocabulary. 
Rdcently, a lot of research works have been proposed to solve 
sbove two problems. They considered the stage recognition 
single-label multi-class classification problem while the 
annotation was treated as a mu lti-label multi-class 
fication problem. (iKumar et al[ 120021) first developed an 
enclosing algorithm to find the embryo outline and 
:t the binary e xpress ion pa tterns via adaptive t hresholding. 
and Mverl l2004h and dPeng et all 120071) developed 
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new ways to represent ISH images based on Gaussian mixture 
models, principal component analysis and wavelet functions. 
Besides that, they utilized min-redundancy max-relevance to do 
the feature selection and automatically c lassify gene exp r ession 
parte n developmental stages. Recently, dPunivani et all l2010l) 
constructed a system (called SPEX 2 ) and concluded that the 
local regression (LR) method taking advantage of the controlled 
term- -term interactions can get the best enhanced anatomical 
controlled term annotation results. The LR method was proposed 
by Ji et al. and developed based on their previous works 
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Fig. 1. Examples of Drosophila embryo ISH images and associated 
anatomical annotation terms in the stages 4-6, 7-8, 9-10, 11-12 and 13-16 
in the BDGP database. The darker stained region highlights the place where 
the gene is expressed. The darker color the region has, the higher the gene 
expression level is 

Jji et a/.L.2008[l201Ql : lLi ^^Zll2009l : IShuiwang et a/.Ll2009.) . All of 

the above methods have provided new inspirations and insights 
for classifying or annotating Drosophila gene expression patterns 
captured by ISH. However, none of them considered doing those two 
tasks simultaneously. As we know, intuitively, anatomical controlled 
vocabulary terms provide evidence for the stage label and vice versa. 
For example, the early stage range is more likely annotated with the 
controlled terms such as 'statu nascendi' and 'celluar' than the terms 
'embryonic' and 'epidermis'. Therefore, besides the image-stage 
and image-annotation relationships which have been well studied 
and applied in the previous research, it is necessary to take advantage 
of the correlations between stage classes and annotation terms. 

In this article, we propose a novel Tri-Relational Graph (TG) 
model that comprises the data graph, anatomical controlled terms 
graph, developmental stage label graph to jointly classify the 
stage of images and annotate anatomical terms simultaneously. 
Upon the TG model, we introduce a Preferential Random 
Walk (PRW) method to simultaneously produce image-to- stage, 
image- to- annotation, image-to-image, stage-to-image, stage-to- 
annotation, stage-to- stage, annotation-to-image, annotation- to- stage 
and annotation- to- annotation relevances to jointly learn the salient 
patterns among images that are predictive of their stage label 
and anatomical annotation terms. Our method achieves superior 
developmental stage classification performance and anatomical 
terms annotation results compared with the state-of-the-art methods. 

We consider each image bag as a data point and extract the bag- 
of-word features that are widely used in computer vision research as 
the corresponding descriptors. Since the real object is 3D and each 



image can only provide 2D observation from a certain perspective, 
we integrate the bag-of-word features for different views. We 
summarize our contributions as follows: 

(1) This article is the first one to propose a novel solution to 
the questions 'What is the developmental stage?' and 'What 
are the anatomical annotations' simultaneously, given an 
unlabeled image bag. 

(2) Via the new TG model that we constructed, the relationships 
between stage label and anatomical controlled terms as well 
as the correlations among anatomical terms can be naturally 
and explicitly exploited by the graph-based semi- supervised 
learning methods. 

(3) We propose a new PRW method to seek the hidden 
annotation-annotation and annotation-stage relevances. 
Other than only using image-to-image relevance conducted 
by existing methods, we can directly predict the stage label 
and annotate anatomical controlled terms for unknown image 
bags. 

2 DATA DESCRIPTORS 

As we known, the Drosophila embryos are 3D objects. However, 
the corresponding image data can only demonstrate 2D information 
from a certain view. Since recent study has shown that incorporating 
images from different views can im prove the classification 
performance consistently dji et ail 120081) . we will use the images 
taken from multiple views instead of one perspective as the data 
descriptor. We only consider the lateral, dorsal and ventral images 
in our experiment due to the fact that the number of images taken 
from other views is much less than that of the above three views. All 
the images from BDGP database have been pre-processed, including 
alignment and resizing to 128 x 320 gr ay images. F or the sake of 
simplicity, we extract the popular SIFT (lLowell2004l) features from 
the regular patches with the radius as well as the spacing as 16 pixels 
dShuiwang et ail Eo09h . which is shown in Figure |2] Specifically, 
we extract one SIFT descriptor with 128 dimensions on each patch 
and each image is represented by 133 (7x 19) SIFT descriptors. 
Nevertheless, the above SIFT features cannot be directly used to 
measure similarity between data points (image bags), because the 
number of images in each image bag is different. In order to get a 
desired equal length descriptor to release the burden of later learning 
task, we need to build codebook for all extracted SIFT features first 
and then redo the data representations for each image bag based on 
the constructed codebook. 

2.1 Codebook construction 

Usually the codebook is established by conducting the clustering 
algorithms on a subset of the local features, and the cluster centers 
are then chosen as the visual words of the codebook. In our study, 
we use ^-means to do the clustering on the training image bags. 
Since the result of ^-means depends on the initial centers, we repeat 
it with 10 random initializations from which the one resulting in 
the smallest objective function value is selected. The number of 
clusters is set to 1000, 500 and 250 for lateral, dorsal and ventral 
images, respectively, according to the total number of images for 
each view as shown in Table Q] (Other codebook sizes gave similar 
performance.) 
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Fig. 2. Demonstration of the regular patches. We extract one SIFT feature 
on one patch, where the radius and spacing of the regular patches are set to 
16 pixels 

Table 1. The statistics summary of the refined BDGP images with 79 terms 



Stage range 4-6 7-8 9-10 11-12 13-16 Total 



y. = {y c J ,y? T ] T e {0, \ } K c+ K a . Without loss of generality, we assume the first 
/ < n image bags are already labeled, which are denoted as T = {x;, y f } j =1 . Our 
task is to learn a function/ :X -> {0, \ } K c+ K a from T that is able to classify an 
unlabeled data point x/(/ + 1 < i < n) into one stage class in C and to annotate 
it with a number of anatomical terms in A at the same time. For simplicity, we 
write Y c = [y c l ,---,y c n ], F a = [yf , ••• ,y£], and F = [y : ••• ,y n ]. As introduced 
in Section 1, the stage class and anatomical terms have some relations. We 
utilize the following affinity matrix to model their interrelations, R e R Kc xKa , 
where R(iJ) indicates how closely class C; and term a/ are related. In this 
work, we compute it as 

R(i,j) = cos(y? ,yj)=<?i,yf>/( || y] || || yj || ) (D 

where y^ is the i-th row of F c and y? is the y'-th row of Y a . Throughout this 
article, we denote a vector as a bold lowercase character and a matrix as an 
uppercase character. We denote the i-th entry of a vector v as v(i), and the 
entry at the z'-th row and y'-th column of a matrix M as M(i,j). \ \\\ \ denotes 
the Euclidian norm of vector v. And the inner product of two vector vi and 
V2 is defined as < vi , \2 > = v\ V2 • 



Size of control term 
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12 


12 


20 


31 


79 


No. of image bags 


500 


500 


500 


500 


500 


2500 


No. of lateral images 


1514 


812 


727 


1356 


1004 


5414 


No. of dorsal images 


226 


324 


431 


447 


724 


2152 


No. of ventral images 


164 


137 


81 


214 


216 


812 



3.1 The construction of TG 



2.2 Data (image bag) representations 

After we get three codebooks, the images in each bag are quantized 
separately for each view. Features computed from patches on images 
with a certain view are compared with the visual words in the 
corresponding codebook and the visual word closest to the feature 
in terms of Euclidean distance is utilized to represent it. Therefore, 
if an image bag encompasses the images from three views, then 
it could be represented by three bags of words, one for each 
view. We concatenate the three vectors so that the images with 
different views (lateral, dorsal and ventral) in one bag can be 
represented by one vector. To be specific, Let eR 1000 ,x^ eR 500 
and x v e R 250 denote the bag-of- words vector for images in a bag 
with lateral, dorsal and ventral view, respectively. The descriptor for 
this image bag can be represented as x = [x^;x^;x v ] eR 1750 . Since 
not all the image bags enclose the images from all three views, the 
corresponding bag-of- words representation is a vector of zeroes if a 
specific view is absent. Moreover, in order to capture the variability 
of the number of images in each view and each bag, we normalized 
the bag-of-words vector to unit length. At last, each image bag is 
represented by a normalized vector x. 



3 METHODS 

In this section, we first construct a TG to model Drosophila gene expression 
patterns followed by proposing a novel PRW method. Using PRW on TG, we 
jointly make stage classification and annotate anatomical terms of Drosophila 
gene expression patterns. 

For the Drosophila gene expression pattern data, we have n gene 
expression images bags X = {xi , • • • ,x n }, where each image bag is abstracted 
as a data point xteW. Each data point X; belongs to one of K c stage 
classes C = {c\ , • • • , ck c ) represented by y^ e {0, \} Kc , such that y c t (k) = 1 if 
Xt is classified into class Ck, and 0 otherwise. Meanwhile, each image 
bag Xi is also annotated with a number of anatomical ontology terms 
A={a\, ••• ,aK a } represented by y?e{0, l} Ka , such that y?(k) = l if x; is 
annotated with term at, and 0 otherwise. Also, for convenience, we write 



Given the dataset X , pairwise similarity Wx eR nxn between data points can 
be computed using the Gaussian kernel function, 



W x (iJ) = 



exp(— x/— Xj 
0, 



/2a 2 , 
otherwise 



i+3 



(2) 



where the vector x is calculated using bag-of-word features for one image 
bag. Regarding \}c^^w^^^^^^^^^^^^^^^^^c^^a^^t\\i\k- 
Manor and Perona T2004l) . WV (i, i) indicates how closely x, and x,- are related. 
From Wx , a graph Qx = (Vx , £x) can be induced, where Vx =X and Ex ^ 
Vx x Vx- And we use kNN graph. To be specific, we connect x;,xy if one 
of them is among the other's k nearest neighbor and define the value of 
the edge connecting them by Equation J3- Because Qx characterizes the 
relationships between data points, it is usually called as data graph, such 
as the middle sub graph in Figure [3] Existing graph-b ased semi-supervised 
learning methods ( iKang et all 120061 : Izha et a/.ll2008h only make use of the 
data graph, on which the class label information is propagated. 

Different from conventional single-label classification learning problem in 
which classes are mutual exclusive, the anatomical terms are interrelated with 
one another. Again, we resort to cosine similarity to calculate the controlled 
term affinity matrix Wa, where Wa(iJ) indicates the correlation between 
a* and ay. Thus, a graph Q a = (Va,£a) is induced, where Va = A and Sa ^ 
Va x Va • We call Qa as annotation label subgraph, which is shown as the 
right subgraph in Figure |3] Similarly, stage classification label graph shown 
as the left subgraph in Figure [3] Q c = (y c ,£ c ) can be constructed from 
stage classification labels, where Vc =C and ^c^VcxVc, where we define 
the value of the edge connecting two stage labels as Wc(i,j) = \Sbi — Sty \\ F , 
where || \\ F means Frobenius norm and denotes the between class scatter 
matrix for stage i. Connecting Qx and Qa by the annotation associations via 
the green dashed lines, connecting Qx and Qc by the class associations via 
the blue dashed lines and connecting Qc and Qa by the stage-term association 
via the purple dashed lines, we construct a TG as following: 



G = {VxUVcUVa,£xU£aUSxcUSxaUEcaX 



(3) 



which is illustrated in Figure[3] Obviously, the subgraph Qax = (Vx , Va , £xa) 
connects Gx and Ga, whose adjacency matrix is Fj. Similarly, the 
adjacency matrix of Qcx = (Vx , Vc , £xc) is • The subgraph (Vc , Va , £ca) 
characterizes the associations between stage classes and anatomical terms 
whose adjacency matrix is R defined in Equation (fl. 

In contrast to existing graph-based semi-supervised learning methods that 
only use information conveyed by Qx , we aim to simultaneously classify and 
annotate an unlabeled data point using all the information encoded in Q. Since 
all data points (gene expression image bags), stage terms and annotation 
terms are equally regarded as vertices on Q, our task is to measure the 
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Fig. 3. The TG constructed from the gene expression data. Solid lines indicate affinity between vertices within in a same subgraph, dashed lines indicates 
associations between vertices in two different subgraphs 



relevance between a class/anntotation term vertex and a data point vertex. 
As each class/annotation term has a set of associated training data points, 
which convey the same biological record information as the class/annotation 
term, we consider both a class/annotation term vertex and its labeled training 
image bag vertices as a group set, 



G*=<fcU{x/|y/(*) = l} 



(4) 



which is illustrated as the vertices with orange boundary in [3] As a result, 
instead of measuring vertex-to-vertex relevance between a class/annotatation 
term vertex and an unlabeled data point vertex, we may measure the set-to- 
vertex relevancebetwe^rUhegro^ 



and Paz ^]<)9$b^Ton^UiM 

random walk and use its equilibrium probability to measure the relevance 
between a group set and an unlabeled data point. 

3.2 Preferential random walk 

Standard random walk on a graph W can be described as a Markov process 
with transition probability M =D~ l W, where d/ = J^.W(i J) is the degree 
of vertex i and D = diag(&\, ••• ,d n ). Clearly, M T and J^.M (£,j) = l. 
When W is symmetric, it corresponds to an undirected graph. When W 
is asymmetric, it corresponds to a directed graph and d; is the out degree 
of vertex i. Let be the distribution of the random walker at time t, 
the distribution at t + 1 is given by p (t+1) (j) = J2iV (t) (0M(iJ). Thus, the 
equilibrium (stationary) distribution of the random walk p*=p^ = °°) is 
determined by M r p* =p*. It is well known that the solution is simply given 
by p* = We/(J2idi) = d/(J2i d/), where d = [di , • • • , d n f . 

It can be seen that the equilibrium distribution of a standard random walk 
is solely determined by the graph itself, but independent of the location where 
the random walk is initiated. In order to incorporate label information, we 
propose the following PRW: 



p('+D (/) = ( 1 - a ) J2 . P (0 (i)M (i J) + ahj , 



(5) 



where 0 ^ a ^ 1 is a fixed parameter, and h, called preferential distribution, 
is a probability distribution such that h(z')^0 and J]-h(0 = l. Equation Q 
describes a random walk process in which the random walker hops on 



the graph W according to the transition matrix M with probability I— a, 
and meanwhile it takes a preference to go to other vertices specified by h 
with probability a. The equilibrium distribution of PRW in Equation {5j is 
determined by p* = (1 — a)M T \)* +ah, which leads to: 



p* = a [I-(l -a)M r ] _1 h. 



(6) 



Due to Perron-Frobenius theorem, the maximum eigenvalue of M is less than 
maxi^jM (i,j) = 1. Thus, / — (1 —a)M T is positive definite and invertible. 
Equation J5j takes a si milar form to two existing works: random walk with 
resta rt (RWR) method fibng et q/.U2006h and PageRank algorithm ( Brin and 
Page J 19981) . In the former, h is a vector with all entries to be 0 except one 
entry to be 1 indicating the vertex where the random walk could be restarted; 
whil e in the latter, h is a constant vector called as damping factor ( Brin and 



Page J 19981) . In contrast, the preferential distribution vector h in Equation J5) 
is a generic probability distribution, which is flexible thereby more powerful. 
Most importantly, through h we can assess group-to-vertex relevance, while 
RWR and PageRan k methods measur e vertex-to -vertex relevance. 

Similar to RWR iTong et <3/ll2006l) , when we set the h to be a probability 
distribution in which all the entries are 0 except for those corresponding to 
Qk> P*(0 measures how relevant the k-th group is to the i-th vertex on Q. 

3.3 Preferential random walk on TG 

In order to classify and annotate unlabeled data points using the equilibrium 
probabilities in Equation 0 of the PRW on TG, we need to construct the 
transition matrix M and the preferential distribution h from Q. 
Construction of the transition matrix M: 
Let 



M — 



M x M X c Mxa 
M C x M c M C a 
Max M A c M a 



(7) 



where Mx, Mc and Ma are the intrasubgraph transition matrices of Qx, 
Qc and Qa respectively, and the rest six sub-matrices are the intersubgraph 
transition matrices among Qx, Qc and Qa- Let fi\ e [0, 1] be the jumping 
probability, i.e. the probability that a random walker hops from Qx to Qc 
and vice versa. And let /?2 e [0, 1] be the jumping probability from Qx to 
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Qa or vice versa. Therefore, (3\ and f>i regulates the reinforcement between 
Gx and one of the other two subgraphs. When both $\ = 0 and p2 = 0, the 
random walk are performed independently on Gx, which is equivalent to 
existing graph-based semi-supervised learning methods only using the data 
graph Gx • Similarly, we define X as the jumping probability from Gc to Ga 
or vice versa. 

During a random walk process, if the random walker is on a vertex of the 
data subgraph which has at least one connection to the label subgraph, such 
as vertex xi in Figure[3] she can hops to the class label or annotation subgraph 
with probability $\ or annotation subgraph with probability fo, or stay on 
the data subgraph with probability 1 — /3i — and hop to other vertices of 
the data subgraph. If the random walker is on a vertex of the data subgraph 
without a connection to the class label or annotation subgraph, she stays on 
the data subgraph and hops to other vertices on it as in standard random walk 

process. To be more precise, let d i c = 2_J/ the transition probability 

from x/ to cj is defined as following: 



p(Cj\xi)=M X c(iJ) = 



9iYT(ij)/d;° , < c >0, (g) 
0, otherwise. 



Similarly, let d i c = ^2jY c (i,j), the transition probability from C; to xy is: 

Following the same definition, the rest four inter- subgraph transition 
probability matrices are defined as: 



p(SLj\Xi)=MxA(i,j) = 



p(x j \3L i ) = M AX (iJ) = 



p 2 Yj(i,j)/dJ a , ifdf«>0, 
0, otherwise. 



/3 2 Y a (i,j)/d; a , ifdf a >0, 
0, otherwise. 



where df a = £. Fj(zj) and df a = £. Y a (i,j), and 



p(Cj\sLi)=M AC (iJ) = 



XR T (i,j)/df, ifdf T >0, 
0, otherwise. 



, - v a » / • *\ f XR(iJ)/df, if df >0, 
f\ j\ u cav ,jj J 0 ^ otherwise 



(10) 



(11) 



(12) 



(13) 



where df T =J2jR T (iJ) and df = J2jR(iJ). 

Let df = £ • W X (i J), df = £ • Y(iJ), df a = £ . Q a (iJ), df c = £ . Q c (iJ) 
where Q a =R+ Y a and Q c =R T + Y c . 

The data subgraph intra transition probability from x/ to xj is computed as: 



p(Xj\x i )=M x (i,j) = 



Wx (ij)/df , otherwise 



(14) 



Construction of the preferential distribution H : the preferential distribution 
vector specifies a group of vertices to which the random walker prefers 
to moving in every iteration step. The relevance between this group and 
an vertex is measured by the equilibrium distribution of the random 
walk process. Therefore, we construct K = K C +K a preferential distribution 
vectors, one for each semantic group G^: 



h<*> = 



(l-y)hf 



(17) 



where h ( ^(i) = l/£/y/(*) if y f (fc) = 1 and h ( ^(0 = 0, otherwise; ^^(1) = 1, 
if i = k, y G [0, 1] controls how much the random walker prefers to go to the 
data subgraph Gx and other two subgraphs Gc , Qa- It can be verified that 
^2 i h ( - k \i)=l, i.e, h*^ is a probability distribution. Let Ik be the identity 
matrix of size KxK,wq write 



// = [h ! 



-[ 



yH x 

(I-Y)Ik 



(18) 



PRW on TG: given the TG of a dataset, using the transition matrix M 
defined in Equation {7J and the preferential probability matrix H defined in 
Equation {l8j, we can perform PRW on the TG. According to Equation (6)> 
its equilibrium distribution matrix P* is computed as: 



P* = a [I-(l -a)M T Y l H, 



(19) 



P* = [p* v ---,p* K ]eR (n+K)xK , and p* is the equilibrium distribution of the 
PRW taking the k-th semantic group as preference. Therefore, p£(0 (/ + 
l^i^n) measures the relevance between the ^-th class and an unlabeled 
image bag x/. We can predict the stage class from the sub-matrix P* c by 
Equation (f20l and annotate the controlled term s for x/ using the adaptive 
decision boundary method JWang et~al[ l2009h on the submatrix P* a by 
Equation f2H . 



p* _ 



P*(/ + l,l) ••• P*(l + 1,K C ) 



P*(n,l) 



P*(n,K c ) 



'P*(l + 1,K C + 1) P*{l+\,K C +Ka)' 



P*{n,K c + \) 



P*(n,K c + K a ) 



(20) 



(21) 



Since the stage prediction is a single-label classification problem, we select 
the stage label y(i) c * with the maximum probability as the stage label for 
image bag x/ . 



y(i) c * = argmax(p^), Vi = / + 1 , / + 2, . . . , n. 

k 



(22) 



where p^ is the i-th row vector of matrix P* c . Therefore, we can do the stage 
classification and anatomical controlled term annotation simultaneously. 



Similarly, let df = J2j Wa0'J)> the annotation label subgraph intra 
transition probability from a z - to ay is: 



p(aj\ai)=M A (i,j) = 



(i-fc-W A (/j)/jf, if df a >o (15) 

Wa (i J) /df , otherwise 



let df = J2jWc(i,j), the classification label subgraph intra transition 
probability from c z to Cj is: 



(l-p l -X)W c (iJ)/df,ifdf c >0 (16) 
Wc(iJ)/df , otherwise 



p(Cj\Ci)=M c (iJ) = 
It can be easily verified that, J^-M(iJ) = 1, i.e. M is a stochastic matrix, 



4 DATA REFINEMENT 

In this section, we will introduce the details of the data used in 
our experiment. Drosophila embryogenesis has 17 stages, which 
are divided into 6 major ranges, i.e. stages 1-3, 4-6, 7-8, 9-10, 
11-12 and 13-16 (th e stage 17 is usually stu died individually), in 
the BDGP database (iTomancak et al Each image bag is 

labeled with one stage term and many controlled vocabulary terms. 
The total number of anatomical controlled vocabulary terms is 303. 

We used the following way to refine the dataset. First, we only 
keep the image bag data with lateral, dorsal and ventral view 
information. And then, we eliminate six common annotation terms, 
that is, 'no staining, ubiquitous, strong ubiquitous, faint ubiquitous, 
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maternal, rapidly degraded', which can be regarded as outliers 
because they can neither provide stage- specific information nor 
record anatomical structures. After that, we remove the anatomical 
terms whose data sample is <50. We ignore the stage 1-3 data since 
the number of anatomical terms after the above procedure becomes 
2, too small to be compared with other stages. And finally we get 79 
anatomical annotation terms in total that we will consider to annotate 
the unlabeled image bag. 

We refine the data mainly based on the following two reasons. 
On one hand, the annotation terms which appear in too few image 
bags are statistically too weak to be learned effectively. On the other 
hand, since we will use 5-fold cross-validations in our experiments, 
we have to guarantee there is at least one data point associated with 
each anatomical term in each fold. Moreover, in order to balance the 
number of image bags for different stages, we randomly sample 500 
image bags as the data points for each stage. At last, the summary 
of the refined dataset is shown in Table Q] 



5 EXPERIMENT 

In this section, we will conduct experiments to evaluate PRW 
empirically on the refined dataset and compare it with other 
state-of-art classification methods. Since our method can do joint 
classification, in order to evaluate the benefit of joint learning, we 
compare its performance with that of the state-of-art multiclass 
single label or multiclass multilabel algorithms which can only 
handle either stage classification or anatomical term annotation 
problem. Our procedure is to train our model with stage labeled 
and anatomical term annotated image bags. All testing image 
bags are unlabeled with developmental stage and unannotated with 
anatomical controlled terms. 



5.1 Experimental setup 

When constructing PRW on TG, we used kNN graph setting k = 9. 
We used 'inverse' 5-fold cross-validation to determine the values 
of the following five parameters, that is, using 1-fold for training 
and using the remaining 4-folds for testing to mimic the scenario 
in the real application where the number of training data is much 
less than the testing data. In our experiment, we found that the 
following five parameters are not sensitive in certain ranges with 
good performances. fa and X controls the jumping between 
different subgraphs and cannot affect the result much if they are 
assigned in the range of (0.1,0.45). a controls initial preference of 
the random walker and will get stable result if it is assigned in the 
range of (0,0.1). y controls how much the random walker prefers 
to go to the data subgraph or to go to two other subgraphs and it is 
usually in the range of (0.1,0.3). 

Besides those parameters, we also need to initialize the stage 
as well as anatomical controlled terms for the testing image bag 
X/, where i = Z + l,...,n, / is the number of training image bag. In 
our experiment, we used & -nearest neighbor (KNN) method to do 
the initializations for both stage classification and anatomical term 
annotations tasks because of its simplicity and clear intuition. To 
be specific, we use k = l and we abbreviate it as INN. Our joint 
classification framework will self-consistently amend the incorrect 
labels for stage and controlled terms. We perform 10 random splits of 
the data and report the average performance over the 10 trials. Please 



note that, in each trial, we still do 'inverse' 5-fold cross validation 
and record the average performance result as the result of that trial. 

5.2 Image bag stage classification 

Drosophila gene expression pattern stage categorization is a single- 
label multi-class problem. We compare the result of our method with 
that of suppor t vector machine (SV M) with radial basis function 
(RBF) kernel dChang and Lmll200ll) . We use the optimal parameter 
values for C and y got from cross-validation as well. We also 
compare the classification result of INN that we use to do the 
initialization. We assess the classification in terms of the average 
classification accuracy and the average confusion matrices. Since the 
data that we used is class balanced, the mean value of the entries on 
the diagonal of the confusion matrix is also the average classification 
accuracy. From the resulting average confusion matrices shown in 
Figure \5\ we can see that the average prediction accuracy of our 
method is better than that of the other two state-of-art methods, 
especially in the last stage 13-16, where the number of anatomical 
terms is greatly larger than that of the other stages. 

5.3 Image bag controlled vocabulary terms annotation 

Besides the stage classification task, we also validate our method 
by predicting the anatomical controlled terms for the Drosophila 
gene expression patterns, which can be considered as a multi-class 
multi-label classification problem. The conventional classification 
performance metrics in statistical learning, precision and Fl score, 
are utilized to evaluate the proposed methods. For every anatomical 
term, the precision and Fl score are computed following the standard 
definition for the binary cl assification problem. To ad d ress th e multi- 
label scenario, following iTsoumakas and Vlahavasl (l2007h . macro 
and micro average of precision and Fl score are used to assess the 
overall performance across multiple labels. We compared four state 
of art mu l ti-labe l classification methods: local shared sub space (LS) 
dji et all [2008b, local regression (LR) dji et all [2009L harmonic 
function (HF) fehu et all l2003h and random walk fRW) f Zhou 
and Scholkopf. 120041) . All of them are proposed recently to solve 
the multilabel annotation problem. In addition, we compare the 
results of INN as well. For the first three methods we use the 
published codes posted on the corresponding author's websites. And 
we implement the RW method following the original work ( Zhou 
and Scholkopf" T2004l) . For HF and RW methods, we follow the 
original work to solve the multilabel annotation only. Therefore, we 
only evaluate those two methods on data subgraph and annotation 
label subgraph without using any information derived from the 
classification label subgraph such as the stage-term correlation. 
Table [2] shows the average anatomical annotation performance of 
79-term dataset. Compared to the above five stat-of-the-art methods, 
our method has the best results by all metrics. Figure [6] illustrates 
the average Micro Fl score of our method, INN, LS, LR, RW and 
HF approaches for all the anatomical terms on 79-term dataset. And 
again, our method consistently achieves best performance for most 
of the anatomical controlled terms. 

5.4 The advantage of joint learning 

Unlike the traditional work, our proposed method can take advantage 
of all the information to do the stage classification and anatomical 
term annotation simultaneously. Therefore, when the number of 
training data is scare, we can resort to both intrarelations and 
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Fig. 4. The middle part demonstrates the terms-stages correlation and the right part shows the terms-terms correlation of 79 terms. The stage unknown test 
data shown in the left part is classified correctly as Stage 13-16, because of the strong correlation between the predicted stage and its predicted anatomical 
terms and vice versa, NOT the similarity of its first and second nearest neighboring data induced from the data graph only 



Table 2. Annotation prediction performance comparison 
on the 7 9 -term dataset 



Method 


Ma Pre 


Ma Fl 


Mi Pre 


Mi Fl 


INN 


0.3455 


0.3595 


0.2318 


0.2230 


LS 


0.5640 


0.3778 


0.3516 


0.1903 


LR 


0.6049 


0.4425 


0.3953 


0.2243 


RW 


0.4019 


0.3385 


0.2808 


0.1835 


HF 


0.3727 


0.3296 


0.2756 


0.1733 


Our method 


0.6125 


0.4434 


0.4057 


0.2336 



Ma Pre, Avg. Macro Precision; Ma Fl, Avg. Macro Fl; Mi Pre, 
Avg. Micro Precision; Mi Fl, Avg. Micro Fl. 
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Fig. 5. Stage classification results in terms of confusion matrices on 79-term 
dataset: (a) the confusion matrix calculated by SVM (b) the confusion matrix 
calculated by INN. (c) the confusion matrix calculated by our method, (a) 
SVM: acc. 84.50%; (b) INN: acc. 77.40%; (c) our: acc. 85.20% 



interrelations to make the decision for stage classification and 
anatomical controlled term annotation simultaneously. When there 
are strong correlation between those two tasks, we expect that 
the performance of both tasks will be enhanced by joint learning 
work than treating them individually and independently. Figure |4] 
shows the pairwise label correlations of the 79 terms and stage-term 
correlations between 5 stages and 79 terms. As highlighted by purple 
arrows, we can observe that there are high pairwise correlations 
between the terms 'embryonic brain', 'ventral nerve cord' as well 
as 'embryonic/larval muscle system'. Moreover, all the above three 
terms have high correlations with the stage 13-16, which can provide 
strong evidence that the given testing image bag could belong to 
the last developmental stage besides the induction from the data 
graph only. If our joint classification framework annotates it with all 
those three terms, although from the data similarity we cannot get 



strong evidence for the stage prediction, we can take advantage of 
the term-term as well as term-stage high correlations to adjust its 
stage to stage 13-16. In other words, relevant anatomical terms could 
help us to predict the stage label since they provide the spatial and 
temporal information of local structure corresponding to a specific 
embryo development stage. Nevertheless, not all anatomical terms 
will definitely benefit stage classification, which is consistent with 
our stage classification result. From Figure \5\ we can see that our 
method may have competitive result compared with SVM with 
respect to some certain stage. However, given more anatomical 
term information, the performance of our method will gradually 
outperform the other methods, especially for the prediction result of 
stage 13-16. 
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Fig. 6. The Avg. Micro Fl score of five methods on each term in 79-term dataset. (It is better to be viewed in colorful and zoomed in mode.) 



5.5 The more meaningful asymmetric correlation 
matrix 

When we build the TG, at first, we assume the term-term correlation 
and stage-term correlation are both symmetric, since we used 
cosine similarity to represent their correlations. However, the above 
assumption does not always hold in the real data. In Drosophila 
embryo gene expression images, we found that the conditional 
probability of the occurrence of term 'ventral nerve cord' given term 
'embryonic brain' is higher than that of the 'embryonic brain' given 
'ventral nerve cord' , which satisfies the biology meaning that ventral 
nerve cord occurs earlier than embryonic brain. After learning, our 
method can automatically discover the above hidden asymmetric 
correlation information, that is, 



r aa 



'P*(n+K c + \,K c + \) P*(n+K c + 1,K)' 



P*(n+K,K c + l) 



P*(n+K,K) 



(23) 



In order to see the learned asymmetric term-term correlation 
more clearly, in Figure [7] we show the difference matrix got by 
Paa~Paa T - Taking those more accurate asymmetric correlation 
into consideration, our method can potentially improve both stage 
classification and anatomical annotation results even more. 



6 CONCLUSION 

In this article, we proposed a novel TG model to learn the 
task interrelations between stage recognition and anatomical terms 
annotation of Drosophila gene expression patterns. The standard 
bag-of-word features and three major views (lateral, dorsal and 
ventral) were used to describe the 3D Drosophila images. A new 
PRW method was introduced to simultaneously propagate the 
stage labels and anatomical controlled terms via TG model. Both 
stage classification and anatomical controlled term annotation tasks 
are jointly completed. We evaluated the proposed method using one 
refined BDGP dataset. The experimental results demonstrated in 
the real application, when the number of training data is scarce, 
our joint learning method can achieve superior prediction results on 
both tasks than the state-of-the-art methods. What is more, we can 
discovery more accurate asymmetric term-term correlation, which 
can potentially improve the results of both tasks even more. 




Fig. 7. The learned difference matrix. (It is better to be viewed in colorful and 
zoomed in mode.) In order to see the asymmetric entries more clearly, we plot 
p ta~ P aa • After PRW, the entries marked as brighter square have higher 
conditional probability (positive correlation) than its counterpart which is 
marked as darker color. This asymmetric reflects more accurate term-term 
correlation than the original symmetric assumption 
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